FIFA is one of the most known videogame and the most famous sport title in the industry, in particular we considered FIFA 22 edition. Each player covers a specific position on the field; what we want to do is building some models to classify the position of the player, based on the values of its attributes. It’s important to consider that some players may share some features with footballers playing in another position, and this may influence our task. For example, some attacking midfielders (CAM) have a good shot and pace, just like wingers (RW, LW). We will keep this into account and adjust our classification accordingly.
The original dataset has been extracted from https://sofifa.com/ and contains 19239 players described by 110 different features.
## Loading required package: viridisLite
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: package 'gmodels' was built under R version 4.2.1
## Warning: package 'e1071' was built under R version 4.2.1
## Warning: package 'tidyverse' was built under R version 4.2.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.7 v purrr 0.3.4
## v tidyr 1.2.0 v forcats 0.5.1
## v readr 2.1.2
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'reshape' was built under R version 4.2.1
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## The following object is masked from 'package:class':
##
## condense
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:reshape2':
##
## colsplit, melt, recast
## Warning: package 'caret' was built under R version 4.2.1
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning: package 'randomForest' was built under R version 4.2.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'cvms' was built under R version 4.2.1
We set the seed for reproducible experiments
set.seed(123)
First we load the dataset, and check the dimension.
players_full <- read.csv("E:/horatiu/Documents/players_22.csv") #full dataframe
dim(players_full) #full dataset
## [1] 19239 110
We have more or less 20k players with 110 attributes. Below we look at how those attributes are named.
colnames(players_full)
## [1] "sofifa_id" "player_url"
## [3] "short_name" "long_name"
## [5] "player_positions" "overall"
## [7] "potential" "value_eur"
## [9] "wage_eur" "age"
## [11] "dob" "height_cm"
## [13] "weight_kg" "club_team_id"
## [15] "club_name" "league_name"
## [17] "league_level" "club_position"
## [19] "club_jersey_number" "club_loaned_from"
## [21] "club_joined" "club_contract_valid_until"
## [23] "nationality_id" "nationality_name"
## [25] "nation_team_id" "nation_position"
## [27] "nation_jersey_number" "preferred_foot"
## [29] "weak_foot" "skill_moves"
## [31] "international_reputation" "work_rate"
## [33] "body_type" "real_face"
## [35] "release_clause_eur" "player_tags"
## [37] "player_traits" "pace"
## [39] "shooting" "passing"
## [41] "dribbling" "defending"
## [43] "physic" "attacking_crossing"
## [45] "attacking_finishing" "attacking_heading_accuracy"
## [47] "attacking_short_passing" "attacking_volleys"
## [49] "skill_dribbling" "skill_curve"
## [51] "skill_fk_accuracy" "skill_long_passing"
## [53] "skill_ball_control" "movement_acceleration"
## [55] "movement_sprint_speed" "movement_agility"
## [57] "movement_reactions" "movement_balance"
## [59] "power_shot_power" "power_jumping"
## [61] "power_stamina" "power_strength"
## [63] "power_long_shots" "mentality_aggression"
## [65] "mentality_interceptions" "mentality_positioning"
## [67] "mentality_vision" "mentality_penalties"
## [69] "mentality_composure" "defending_marking_awareness"
## [71] "defending_standing_tackle" "defending_sliding_tackle"
## [73] "goalkeeping_diving" "goalkeeping_handling"
## [75] "goalkeeping_kicking" "goalkeeping_positioning"
## [77] "goalkeeping_reflexes" "goalkeeping_speed"
## [79] "ls" "st"
## [81] "rs" "lw"
## [83] "lf" "cf"
## [85] "rf" "rw"
## [87] "lam" "cam"
## [89] "ram" "lm"
## [91] "lcm" "cm"
## [93] "rcm" "rm"
## [95] "lwb" "ldm"
## [97] "cdm" "rdm"
## [99] "rwb" "lb"
## [101] "lcb" "cb"
## [103] "rcb" "rb"
## [105] "gk" "player_face_url"
## [107] "club_logo_url" "club_flag_url"
## [109] "nation_logo_url" "nation_flag_url"
To get a better general idea, we also want to look at the type of data they provide
head(players_full, 10)
## sofifa_id player_url
## 1 158023 https://sofifa.com/player/158023/lionel-messi/220002
## 2 188545 https://sofifa.com/player/188545/robert-lewandowski/220002
## 3 20801 https://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/220002
## 4 190871 https://sofifa.com/player/190871/neymar-da-silva-santos-jr/220002
## 5 192985 https://sofifa.com/player/192985/kevin-de-bruyne/220002
## 6 200389 https://sofifa.com/player/200389/jan-oblak/220002
## 7 231747 https://sofifa.com/player/231747/kylian-mbappe/220002
## 8 167495 https://sofifa.com/player/167495/manuel-neuer/220002
## 9 192448 https://sofifa.com/player/192448/marc-andre-ter-stegen/220002
## 10 202126 https://sofifa.com/player/202126/harry-kane/220002
## short_name long_name player_positions
## 1 L. Messi Lionel Andrés Messi Cuccittini RW, ST, CF
## 2 R. Lewandowski Robert Lewandowski ST
## 3 Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro ST, LW
## 4 Neymar Jr Neymar da Silva Santos Júnior LW, CAM
## 5 K. De Bruyne Kevin De Bruyne CM, CAM
## 6 J. Oblak Jan Oblak GK
## 7 K. Mbappé Kylian Mbappé Lottin ST, LW
## 8 M. Neuer Manuel Peter Neuer GK
## 9 M. ter Stegen Marc-André ter Stegen GK
## 10 H. Kane Harry Kane ST
## overall potential value_eur wage_eur age dob height_cm weight_kg
## 1 93 93 78000000 320000 34 1987-06-24 170 72
## 2 92 92 119500000 270000 32 1988-08-21 185 81
## 3 91 91 45000000 270000 36 1985-02-05 187 83
## 4 91 91 129000000 270000 29 1992-02-05 175 68
## 5 91 91 125500000 350000 30 1991-06-28 181 70
## 6 91 93 112000000 130000 28 1993-01-07 188 87
## 7 91 95 194000000 230000 22 1998-12-20 182 73
## 8 90 90 13500000 86000 35 1986-03-27 193 93
## 9 90 92 99000000 250000 29 1992-04-30 187 85
## 10 90 90 129500000 240000 27 1993-07-28 188 89
## club_team_id club_name league_name league_level
## 1 73 Paris Saint-Germain French Ligue 1 1
## 2 21 FC Bayern München German 1. Bundesliga 1
## 3 11 Manchester United English Premier League 1
## 4 73 Paris Saint-Germain French Ligue 1 1
## 5 10 Manchester City English Premier League 1
## 6 240 Atlético de Madrid Spain Primera Division 1
## 7 73 Paris Saint-Germain French Ligue 1 1
## 8 21 FC Bayern München German 1. Bundesliga 1
## 9 241 FC Barcelona Spain Primera Division 1
## 10 18 Tottenham Hotspur English Premier League 1
## club_position club_jersey_number club_loaned_from club_joined
## 1 RW 30 2021-08-10
## 2 ST 9 2014-07-01
## 3 ST 7 2021-08-27
## 4 LW 10 2017-08-03
## 5 RCM 17 2015-08-30
## 6 GK 13 2014-07-16
## 7 ST 7 2018-07-01
## 8 GK 1 2011-07-01
## 9 GK 1 2014-07-01
## 10 ST 10 2010-07-28
## club_contract_valid_until nationality_id nationality_name nation_team_id
## 1 2023 52 Argentina 1369
## 2 2023 37 Poland 1353
## 3 2023 38 Portugal 1354
## 4 2025 54 Brazil NA
## 5 2025 7 Belgium 1325
## 6 2023 44 Slovenia NA
## 7 2022 18 France 1335
## 8 2023 21 Germany 1337
## 9 2025 21 Germany NA
## 10 2024 14 England 1318
## nation_position nation_jersey_number preferred_foot weak_foot skill_moves
## 1 RW 10 Left 4 4
## 2 RS 9 Right 4 4
## 3 ST 7 Right 4 5
## 4 NA Right 5 5
## 5 RCM 7 Right 5 4
## 6 NA Right 3 1
## 7 LW 10 Right 4 5
## 8 GK 1 Right 4 1
## 9 NA Right 4 1
## 10 ST 9 Right 5 3
## international_reputation work_rate body_type real_face
## 1 5 Medium/Low Unique Yes
## 2 5 High/Medium Unique Yes
## 3 5 High/Low Unique Yes
## 4 5 High/Medium Unique Yes
## 5 4 High/High Unique Yes
## 6 5 Medium/Medium Unique Yes
## 7 4 High/Low Unique Yes
## 8 5 Medium/Medium Unique Yes
## 9 4 Medium/Medium Unique Yes
## 10 4 High/High Unique Yes
## release_clause_eur
## 1 144300000
## 2 197200000
## 3 83300000
## 4 238700000
## 5 232200000
## 6 238000000
## 7 373500000
## 8 22300000
## 9 210400000
## 10 246100000
## player_tags
## 1 #Dribbler, #Distance Shooter, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Forward
## 2 #Aerial Threat, #Distance Shooter, #Clinical Finisher, #Complete Forward
## 3 #Aerial Threat, #Dribbler, #Distance Shooter, #Crosser, #Acrobat, #Clinical Finisher, #Complete Forward
## 4 #Speedster, #Dribbler, #Playmaker, #FK Specialist, #Acrobat, #Complete Midfielder
## 5 #Dribbler, #Playmaker, #Engine, #Distance Shooter, #Crosser, #Complete Midfielder
## 6
## 7 #Speedster, #Dribbler, #Acrobat, #Clinical Finisher, #Complete Forward
## 8
## 9
## 10 #Distance Shooter, #Clinical Finisher
## player_traits
## 1 Finesse Shot, Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot, One Club Player, Chip Shot (AI), Technical Dribbler (AI)
## 2 Solid Player, Finesse Shot, Outside Foot Shot, Chip Shot (AI)
## 3 Power Free-Kick, Flair, Long Shot Taker (AI), Speed Dribbler (AI), Outside Foot Shot
## 4 Injury Prone, Flair, Speed Dribbler (AI), Playmaker (AI), Outside Foot Shot, Technical Dribbler (AI)
## 5 Injury Prone, Leadership, Early Crosser, Long Passer (AI), Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot
## 6 GK Long Throw, Comes For Crosses
## 7 Flair, Speed Dribbler (AI), Outside Foot Shot, Technical Dribbler (AI)
## 8 Leadership, GK Long Throw, Rushes Out Of Goal, Comes For Crosses
## 9 Rushes Out Of Goal, Comes For Crosses, Saves with Feet
## 10 Leadership, Long Passer (AI), Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot
## pace shooting passing dribbling defending physic attacking_crossing
## 1 85 92 91 95 34 65 85
## 2 78 92 79 86 44 82 71
## 3 87 94 80 88 34 75 87
## 4 91 83 86 94 37 63 85
## 5 76 86 93 88 64 78 94
## 6 NA NA NA NA NA NA 13
## 7 97 88 80 92 36 77 78
## 8 NA NA NA NA NA NA 15
## 9 NA NA NA NA NA NA 18
## 10 70 91 83 83 47 83 80
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1 95 70 91
## 2 95 90 85
## 3 95 90 80
## 4 83 63 86
## 5 82 55 94
## 6 11 15 43
## 7 93 72 85
## 8 13 25 60
## 9 14 11 61
## 10 94 86 85
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1 88 96 93 94
## 2 89 85 79 85
## 3 86 88 81 84
## 4 86 95 88 87
## 5 82 88 85 83
## 6 13 12 13 14
## 7 83 93 80 69
## 8 11 30 14 11
## 9 14 21 18 12
## 10 88 83 83 65
## skill_long_passing skill_ball_control movement_acceleration
## 1 91 96 91
## 2 70 88 77
## 3 77 88 85
## 4 81 95 93
## 5 93 91 76
## 6 40 30 43
## 7 71 91 97
## 8 68 46 54
## 9 63 30 38
## 10 86 85 65
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 80 91 94 95
## 2 79 77 93 82
## 3 88 86 94 74
## 4 89 96 89 84
## 5 76 79 91 78
## 6 60 67 88 49
## 7 97 92 93 83
## 8 60 51 87 35
## 9 50 39 86 43
## 10 74 71 92 70
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 86 68 72 69 94
## 2 90 85 76 86 87
## 3 94 95 77 77 93
## 4 80 64 81 53 81
## 5 91 63 89 74 91
## 6 59 78 41 78 12
## 7 86 78 88 77 82
## 8 68 77 43 80 16
## 9 66 79 35 78 10
## 10 91 79 83 85 86
## mentality_aggression mentality_interceptions mentality_positioning
## 1 44 40 93
## 2 81 49 95
## 3 63 29 95
## 4 63 37 86
## 5 76 66 88
## 6 34 19 11
## 7 62 38 92
## 8 29 30 12
## 9 43 22 11
## 10 80 44 94
## mentality_vision mentality_penalties mentality_composure
## 1 95 75 96
## 2 81 90 88
## 3 76 88 95
## 4 90 93 93
## 5 94 83 89
## 6 65 11 68
## 7 82 79 88
## 8 70 47 70
## 9 70 25 70
## 10 87 91 91
## defending_marking_awareness defending_standing_tackle
## 1 20 35
## 2 35 42
## 3 24 32
## 4 35 32
## 5 68 65
## 6 27 12
## 7 26 34
## 8 17 10
## 9 25 13
## 10 50 36
## defending_sliding_tackle goalkeeping_diving goalkeeping_handling
## 1 24 6 11
## 2 19 15 6
## 3 24 7 11
## 4 29 9 9
## 5 53 15 13
## 6 18 87 92
## 7 32 13 5
## 8 11 88 88
## 9 10 88 85
## 10 38 8 10
## goalkeeping_kicking goalkeeping_positioning goalkeeping_reflexes
## 1 15 14 8
## 2 12 8 10
## 3 15 14 11
## 4 15 15 11
## 5 5 10 13
## 6 78 90 90
## 7 7 11 6
## 8 91 89 88
## 9 88 88 90
## 10 11 14 11
## goalkeeping_speed ls st rs lw lf cf rf rw lam cam ram lm lcm
## 1 NA 89+3 89+3 89+3 92 93 93 93 92 93 93 93 91+2 87+3
## 2 NA 90+2 90+2 90+2 85 88 88 88 85 86+3 86+3 86+3 84+3 80+3
## 3 NA 90+1 90+1 90+1 88 89 89 89 88 86+3 86+3 86+3 86+3 78+3
## 4 NA 83+3 83+3 83+3 90 88 88 88 90 89+2 89+2 89+2 89+2 82+3
## 5 NA 83+3 83+3 83+3 88 87 87 87 88 89+2 89+2 89+2 89+2 89+2
## 6 50 33+3 33+3 33+3 32 35 35 35 32 38+3 38+3 38+3 35+3 38+3
## 7 NA 89+3 89+3 89+3 90 90 90 90 90 89+3 89+3 89+3 89+3 81+3
## 8 56 40+3 40+3 40+3 40 43 43 43 40 47+3 47+3 47+3 44+3 50+3
## 9 43 35+3 35+3 35+3 35 38 38 38 35 42+3 42+3 42+3 39+3 45+3
## 10 NA 88+2 88+2 88+2 84 86 86 86 84 85+3 85+3 85+3 84+3 82+3
## cm rcm rm lwb ldm cdm rdm rwb lb lcb cb rcb rb gk
## 1 87+3 87+3 91+2 66+3 64+3 64+3 64+3 66+3 61+3 50+3 50+3 50+3 61+3 19+3
## 2 80+3 80+3 84+3 64+3 66+3 66+3 66+3 64+3 61+3 60+3 60+3 60+3 61+3 19+3
## 3 78+3 78+3 86+3 63+3 59+3 59+3 59+3 63+3 60+3 53+3 53+3 53+3 60+3 20+3
## 4 82+3 82+3 89+2 67+3 63+3 63+3 63+3 67+3 62+3 50+3 50+3 50+3 62+3 20+3
## 5 89+2 89+2 89+2 79+3 80+3 80+3 80+3 79+3 75+3 69+3 69+3 69+3 75+3 21+3
## 6 38+3 38+3 35+3 32+3 36+3 36+3 36+3 32+3 32+3 33+3 33+3 33+3 32+3 89+3
## 7 81+3 81+3 89+3 67+3 63+3 63+3 63+3 67+3 63+3 54+3 54+3 54+3 63+3 18+3
## 8 50+3 50+3 44+3 37+3 43+3 43+3 43+3 37+3 35+3 34+3 34+3 34+3 35+3 88+2
## 9 45+3 45+3 39+3 33+3 41+3 41+3 41+3 33+3 31+3 33+3 33+3 33+3 31+3 88+3
## 10 82+3 82+3 84+3 67+3 68+3 68+3 68+3 67+3 64+3 61+3 61+3 61+3 64+3 20+3
## player_face_url
## 1 https://cdn.sofifa.net/players/158/023/22_120.png
## 2 https://cdn.sofifa.net/players/188/545/22_120.png
## 3 https://cdn.sofifa.net/players/020/801/22_120.png
## 4 https://cdn.sofifa.net/players/190/871/22_120.png
## 5 https://cdn.sofifa.net/players/192/985/22_120.png
## 6 https://cdn.sofifa.net/players/200/389/22_120.png
## 7 https://cdn.sofifa.net/players/231/747/22_120.png
## 8 https://cdn.sofifa.net/players/167/495/22_120.png
## 9 https://cdn.sofifa.net/players/192/448/22_120.png
## 10 https://cdn.sofifa.net/players/202/126/22_120.png
## club_logo_url
## 1 https://cdn.sofifa.net/teams/73/60.png
## 2 https://cdn.sofifa.net/teams/21/60.png
## 3 https://cdn.sofifa.net/teams/11/60.png
## 4 https://cdn.sofifa.net/teams/73/60.png
## 5 https://cdn.sofifa.net/teams/10/60.png
## 6 https://cdn.sofifa.net/teams/240/60.png
## 7 https://cdn.sofifa.net/teams/73/60.png
## 8 https://cdn.sofifa.net/teams/21/60.png
## 9 https://cdn.sofifa.net/teams/241/60.png
## 10 https://cdn.sofifa.net/teams/18/60.png
## club_flag_url
## 1 https://cdn.sofifa.net/flags/fr.png
## 2 https://cdn.sofifa.net/flags/de.png
## 3 https://cdn.sofifa.net/flags/gb-eng.png
## 4 https://cdn.sofifa.net/flags/fr.png
## 5 https://cdn.sofifa.net/flags/gb-eng.png
## 6 https://cdn.sofifa.net/flags/es.png
## 7 https://cdn.sofifa.net/flags/fr.png
## 8 https://cdn.sofifa.net/flags/de.png
## 9 https://cdn.sofifa.net/flags/es.png
## 10 https://cdn.sofifa.net/flags/gb-eng.png
## nation_logo_url
## 1 https://cdn.sofifa.net/teams/1369/60.png
## 2 https://cdn.sofifa.net/teams/1353/60.png
## 3 https://cdn.sofifa.net/teams/1354/60.png
## 4
## 5 https://cdn.sofifa.net/teams/1325/60.png
## 6
## 7 https://cdn.sofifa.net/teams/1335/60.png
## 8 https://cdn.sofifa.net/teams/1337/60.png
## 9
## 10 https://cdn.sofifa.net/teams/1318/60.png
## nation_flag_url
## 1 https://cdn.sofifa.net/flags/ar.png
## 2 https://cdn.sofifa.net/flags/pl.png
## 3 https://cdn.sofifa.net/flags/pt.png
## 4 https://cdn.sofifa.net/flags/br.png
## 5 https://cdn.sofifa.net/flags/be.png
## 6 https://cdn.sofifa.net/flags/si.png
## 7 https://cdn.sofifa.net/flags/fr.png
## 8 https://cdn.sofifa.net/flags/de.png
## 9 https://cdn.sofifa.net/flags/de.png
## 10 https://cdn.sofifa.net/flags/gb-eng.png
We perform a rough removal of all the features that will obviously not be relevant to our classification, or some of the ones that are a obvious linear composition of other features. Moreover, our training will be performed on the league 1 players. Then, we check the dimensions again.
players_full <- players_full[players_full$league_level == 1,]
players_22 <- subset(players_full, select = c("short_name","player_positions","age","height_cm","weight_kg","pace","shooting","passing","preferred_foot","weak_foot","dribbling","defending","physic","attacking_crossing","attacking_finishing","attacking_heading_accuracy","attacking_short_passing","attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy","skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed","movement_agility","movement_reactions","movement_balance","power_shot_power","power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression","mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties","mentality_composure","defending_marking_awareness","defending_standing_tackle","defending_sliding_tackle"))
dim(players_22)
## [1] 14918 42
Apparently we kept only 42 features. Good enough. We will remove more later by performing feature selection so stay tuned.
head(players_22, n=5)
## short_name player_positions age height_cm weight_kg pace shooting
## 1 L. Messi RW, ST, CF 34 170 72 85 92
## 2 R. Lewandowski ST 32 185 81 78 92
## 3 Cristiano Ronaldo ST, LW 36 187 83 87 94
## 4 Neymar Jr LW, CAM 29 175 68 91 83
## 5 K. De Bruyne CM, CAM 30 181 70 76 86
## passing preferred_foot weak_foot dribbling defending physic
## 1 91 Left 4 95 34 65
## 2 79 Right 4 86 44 82
## 3 80 Right 4 88 34 75
## 4 86 Right 5 94 37 63
## 5 93 Right 5 88 64 78
## attacking_crossing attacking_finishing attacking_heading_accuracy
## 1 85 95 70
## 2 71 95 90
## 3 87 95 90
## 4 85 83 63
## 5 94 82 55
## attacking_short_passing attacking_volleys skill_dribbling skill_curve
## 1 91 88 96 93
## 2 85 89 85 79
## 3 80 86 88 81
## 4 86 86 95 88
## 5 94 82 88 85
## skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## 1 94 91 96 91
## 2 85 70 88 77
## 3 84 77 88 85
## 4 87 81 95 93
## 5 83 93 91 76
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 80 91 94 95
## 2 79 77 93 82
## 3 88 86 94 74
## 4 89 96 89 84
## 5 76 79 91 78
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 86 68 72 69 94
## 2 90 85 76 86 87
## 3 94 95 77 77 93
## 4 80 64 81 53 81
## 5 91 63 89 74 91
## mentality_aggression mentality_interceptions mentality_positioning
## 1 44 40 93
## 2 81 49 95
## 3 63 29 95
## 4 63 37 86
## 5 76 66 88
## mentality_vision mentality_penalties mentality_composure
## 1 95 75 96
## 2 81 90 88
## 3 76 88 95
## 4 90 93 93
## 5 94 83 89
## defending_marking_awareness defending_standing_tackle
## 1 20 35
## 2 35 42
## 3 24 32
## 4 35 32
## 5 68 65
## defending_sliding_tackle
## 1 24
## 2 19
## 3 24
## 4 29
## 5 53
We have a short look a numerical summary of all the features we selected. On a first glance they look like they need some normalization. But before that, we would love to make some visual presentations.
summary(players_22)
## short_name player_positions age height_cm
## Length:14918 Length:14918 Min. :16.00 Min. :155
## Class :character Class :character 1st Qu.:21.00 1st Qu.:176
## Mode :character Mode :character Median :25.00 Median :181
## Mean :25.34 Mean :181
## 3rd Qu.:29.00 3rd Qu.:186
## Max. :54.00 Max. :203
## NA's :61 NA's :61
## weight_kg pace shooting passing
## Min. : 49.00 Min. :28.00 Min. :18.0 Min. :25.00
## 1st Qu.: 70.00 1st Qu.:62.00 1st Qu.:42.0 1st Qu.:51.00
## Median : 75.00 Median :69.00 Median :55.0 Median :58.00
## Mean : 74.84 Mean :68.33 Mean :52.8 Mean :57.88
## 3rd Qu.: 80.00 3rd Qu.:76.00 3rd Qu.:64.0 3rd Qu.:65.00
## Max. :107.00 Max. :97.00 Max. :94.0 Max. :93.00
## NA's :61 NA's :1725 NA's :1725 NA's :1725
## preferred_foot weak_foot dribbling defending
## Length:14918 Min. :1.000 Min. :27.00 Min. :15.00
## Class :character 1st Qu.:3.000 1st Qu.:57.00 1st Qu.:38.00
## Mode :character Median :3.000 Median :64.00 Median :56.00
## Mean :2.948 Mean :62.99 Mean :52.03
## 3rd Qu.:3.000 3rd Qu.:70.00 3rd Qu.:65.00
## Max. :5.000 Max. :95.00 Max. :91.00
## NA's :61 NA's :1725 NA's :1725
## physic attacking_crossing attacking_finishing
## Min. :29.00 Min. : 6 Min. : 2.0
## 1st Qu.:59.00 1st Qu.:39 1st Qu.:31.0
## Median :66.00 Median :54 Median :50.0
## Mean :64.89 Mean :50 Mean :46.2
## 3rd Qu.:72.00 3rd Qu.:64 3rd Qu.:62.0
## Max. :90.00 Max. :94 Max. :95.0
## NA's :1725 NA's :61 NA's :61
## attacking_heading_accuracy attacking_short_passing attacking_volleys
## Min. : 5.00 Min. : 7.00 Min. : 3.00
## 1st Qu.:44.00 1st Qu.:55.00 1st Qu.:30.00
## Median :55.00 Median :63.00 Median :44.00
## Mean :51.95 Mean :59.33 Mean :42.89
## 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:57.00
## Max. :93.00 Max. :94.00 Max. :90.00
## NA's :61 NA's :61 NA's :61
## skill_dribbling skill_curve skill_fk_accuracy skill_long_passing
## Min. : 4 Min. : 6.00 Min. : 4.00 Min. : 9.00
## 1st Qu.:50 1st Qu.:35.00 1st Qu.:31.00 1st Qu.:45.00
## Median :62 Median :49.00 Median :41.00 Median :57.00
## Mean :56 Mean :47.73 Mean :42.65 Mean :53.63
## 3rd Qu.:69 3rd Qu.:62.00 3rd Qu.:56.00 3rd Qu.:65.00
## Max. :96 Max. :94.00 Max. :94.00 Max. :93.00
## NA's :61 NA's :61 NA's :61 NA's :61
## skill_ball_control movement_acceleration movement_sprint_speed
## Min. : 8.00 Min. :14.0 Min. :15.00
## 1st Qu.:55.00 1st Qu.:58.0 1st Qu.:58.00
## Median :63.00 Median :68.0 Median :68.00
## Mean :58.88 Mean :64.7 Mean :64.77
## 3rd Qu.:70.00 3rd Qu.:75.0 3rd Qu.:75.00
## Max. :96.00 Max. :97.0 Max. :97.00
## NA's :61 NA's :61 NA's :61
## movement_agility movement_reactions movement_balance power_shot_power
## Min. :18.00 Min. :25.00 Min. :19.0 Min. :20.00
## 1st Qu.:55.00 1st Qu.:56.00 1st Qu.:56.0 1st Qu.:48.00
## Median :66.00 Median :62.00 Median :66.0 Median :59.00
## Mean :63.55 Mean :61.91 Mean :64.1 Mean :58.19
## 3rd Qu.:74.00 3rd Qu.:68.00 3rd Qu.:74.0 3rd Qu.:68.00
## Max. :96.00 Max. :94.00 Max. :96.0 Max. :95.00
## NA's :61 NA's :61 NA's :61 NA's :61
## power_jumping power_stamina power_strength power_long_shots
## Min. :24.00 Min. :12.00 Min. :19.00 Min. : 4.00
## 1st Qu.:57.00 1st Qu.:56.00 1st Qu.:57.00 1st Qu.:32.00
## Median :65.00 Median :67.00 Median :66.00 Median :51.00
## Mean :64.75 Mean :63.15 Mean :64.97 Mean :47.08
## 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00 3rd Qu.:63.00
## Max. :95.00 Max. :97.00 Max. :96.00 Max. :94.00
## NA's :61 NA's :61 NA's :61 NA's :61
## mentality_aggression mentality_interceptions mentality_positioning
## Min. :10.00 Min. : 4.00 Min. : 2.00
## 1st Qu.:45.00 1st Qu.:26.00 1st Qu.:40.00
## Median :59.00 Median :53.00 Median :56.00
## Mean :55.85 Mean :46.95 Mean :50.76
## 3rd Qu.:69.00 3rd Qu.:64.00 3rd Qu.:65.00
## Max. :95.00 Max. :91.00 Max. :96.00
## NA's :61 NA's :61 NA's :61
## mentality_vision mentality_penalties mentality_composure
## Min. :10.00 Min. : 7.00 Min. :12.0
## 1st Qu.:45.00 1st Qu.:38.00 1st Qu.:50.0
## Median :56.00 Median :49.00 Median :59.0
## Mean :54.49 Mean :48.11 Mean :58.4
## 3rd Qu.:65.00 3rd Qu.:60.00 3rd Qu.:67.0
## Max. :95.00 Max. :93.00 Max. :96.0
## NA's :61 NA's :61 NA's :61
## defending_marking_awareness defending_standing_tackle defending_sliding_tackle
## Min. : 4.00 Min. : 5.00 Min. : 5.00
## 1st Qu.:29.00 1st Qu.:28.00 1st Qu.:26.00
## Median :52.00 Median :55.00 Median :53.00
## Mean :46.86 Mean :48.28 Mean :46.12
## 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.:64.00
## Max. :93.00 Max. :93.00 Max. :92.00
## NA's :61 NA's :61 NA's :61
2.2 Managing empty entries
We look at how many NAs we have on each attribute, in order to decide if we prefer removing them or filling them.
which(apply(X = players_22, MARGIN = 2, FUN = anyNA) == TRUE) # check for NA
## short_name player_positions
## 1 2
## age height_cm
## 3 4
## weight_kg pace
## 5 6
## shooting passing
## 7 8
## preferred_foot weak_foot
## 9 10
## dribbling defending
## 11 12
## physic attacking_crossing
## 13 14
## attacking_finishing attacking_heading_accuracy
## 15 16
## attacking_short_passing attacking_volleys
## 17 18
## skill_dribbling skill_curve
## 19 20
## skill_fk_accuracy skill_long_passing
## 21 22
## skill_ball_control movement_acceleration
## 23 24
## movement_sprint_speed movement_agility
## 25 26
## movement_reactions movement_balance
## 27 28
## power_shot_power power_jumping
## 29 30
## power_stamina power_strength
## 31 32
## power_long_shots mentality_aggression
## 33 34
## mentality_interceptions mentality_positioning
## 35 36
## mentality_vision mentality_penalties
## 37 38
## mentality_composure defending_marking_awareness
## 39 40
## defending_standing_tackle defending_sliding_tackle
## 41 42
We decide that we have a statistically dispensable number of NAs so we remove them.
players_22 <- na.omit(players_22) # delete NA
dim(players_22)
## [1] 13193 42
We still have a good chunk of the dataset left. Since goalkeepers have special stats, we also would like to take them out. First, we check how many we have.
goalkeepers <- str_detect(players_22$player_positions, "GK")
sum(goalkeepers)
## [1] 0
Thus, while they are indisposable on the field, we could not say the same about their data, as it would reduce the accuracy of the classification of the other main positions.
players_22<-subset(players_22, player_positions!="GK")
2.3 Labelling
Some players play in multiple positions, but we only want to identify their main one, so we only keep that one. Moreover, we turn the binary “preferred_foot” feature into a numerical type.
#Keep only the main preferred position
players_22$player_positions<- word(players_22$player_positions, 1, sep = fixed(","))
unique(players_22$player_positions)
## [1] "RW" "ST" "LW" "CM" "CDM" "CF" "LM" "CB" "CAM" "LB" "RB" "RM"
## [13] "LWB" "RWB"
# Left foot is -1 and Right foot is 1. Basically one-hot encoding but we only have 2 categories so its easy
players_22$preferred_foot[players_22[,"preferred_foot"]== "Left"] <- as.numeric(-1)
players_22$preferred_foot[players_22[,"preferred_foot"]== "Right"] <- as.numeric(1)
players_22$preferred_foot <- as.numeric(players_22$preferred_foot)
# now we group them into the main 9 positions
Now, we take a look at the positions, and we plan to group them depending on the area of the field that they play in.
Goalkeeper excluded, there are 26 positions, namely:
As mentioned above, since 26 labels positions are clearly too many, we cluster them into nine classes of positions based on area of action on the field.
Note: This is probably the only part where we applied our “domain knowledge”.
#central back
players_22$player_positions[players_22[,"player_positions"]== "LCB"|players_22[,"player_positions"]== "CB"|players_22[,"player_positions"]== "RCB"] <- "CB"
#left back
players_22$player_positions[players_22[,"player_positions"]== "LWB"|players_22[,"player_positions"]== "LB"]<-"LB"
#right back
players_22$player_positions[players_22[,"player_positions"]== "RWB"|players_22[,"player_positions"]== "RB"]<-"RB"
#central deffensive midfielder
players_22$player_positions[players_22[,"player_positions"]== "LDM"|players_22[,"player_positions"]== "CDM"|players_22[,"player_positions"]== "RDM"] <- "CDM"
#central midfielder
players_22$player_positions[players_22[,"player_positions"]== "LCM"|players_22[,"player_positions"]== "CM"|players_22[,"player_positions"]== "RCM"] <- "CM"
#central attacking midfielder
players_22$player_positions[players_22[,"player_positions"]== "LAM"|players_22[,"player_positions"]== "CAM"|players_22[,"player_positions"]== "RAM"] <- "CAM"
#left winger
players_22$player_positions[players_22[,"player_positions"]== "LM"|players_22[,"player_positions"]== "LW"|players_22[,"player_positions"]== "LF"] <- "LW"
#right winger
players_22$player_positions[players_22[,"player_positions"]== "RM"|players_22[,"player_positions"]== "RW"|players_22[,"player_positions"]== "RF"] <- "RW"
#striker
players_22$player_positions[players_22[,"player_positions"]== "LS"|players_22[,"player_positions"]== "CF"|players_22[,"player_positions"]== "RS"] <- "ST"
Lets take a look at the distribution of our labels
cat<- table(factor(players_22$player_positions))
pie(cat,
col = hcl.colors(length(cat), "BluYl"))
Time to normalize the numerical values, as promised. For that, we implement a simple re-scaling function, and we apply it on the whole dataframe.
# normalization function
normalize <-function(x) { (x -min(x))/(max(x)-min(x)) }
# normalize
players_norm <- as.data.frame(lapply(players_22[, c(3:42)], normalize))
head(players_norm,5)
## age height_cm weight_kg pace shooting passing preferred_foot
## 1 0.4736842 0.3125000 0.4423077 0.8260870 0.9736842 0.9705882 0
## 2 0.4210526 0.6250000 0.6153846 0.7246377 0.9736842 0.7941176 1
## 3 0.5263158 0.6666667 0.6538462 0.8550725 1.0000000 0.8088235 1
## 4 0.3421053 0.4166667 0.3653846 0.9130435 0.8552632 0.8970588 1
## 5 0.3684211 0.5416667 0.4038462 0.6956522 0.8947368 1.0000000 1
## weak_foot dribbling defending physic attacking_crossing
## 1 0.75 1.0000000 0.2500000 0.5901639 0.8860759
## 2 0.75 0.8676471 0.3815789 0.8688525 0.7088608
## 3 0.75 0.8970588 0.2500000 0.7540984 0.9113924
## 4 1.00 0.9852941 0.2894737 0.5573770 0.8860759
## 5 1.00 0.8970588 0.6447368 0.8032787 1.0000000
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1 1.0000000 0.6973684 0.9577465
## 2 1.0000000 0.9605263 0.8732394
## 3 1.0000000 0.9605263 0.8028169
## 4 0.8588235 0.6052632 0.8873239
## 5 0.8470588 0.5000000 1.0000000
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1 0.9750 1.0000000 0.9878049 1.0000000
## 2 0.9875 0.8589744 0.8170732 0.8928571
## 3 0.9500 0.8974359 0.8414634 0.8809524
## 4 0.9500 0.9871795 0.9268293 0.9166667
## 5 0.9000 0.8974359 0.8902439 0.8690476
## skill_long_passing skill_ball_control movement_acceleration
## 1 0.9726027 1.0000000 0.9142857
## 2 0.6849315 0.8888889 0.7142857
## 3 0.7808219 0.8888889 0.8285714
## 4 0.8356164 0.9861111 0.9428571
## 5 1.0000000 0.9305556 0.7000000
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 0.7571429 0.9275362 1.0000000 0.9857143
## 2 0.7428571 0.7246377 0.9846154 0.8000000
## 3 0.8714286 0.8550725 1.0000000 0.6857143
## 4 0.8857143 1.0000000 0.9230769 0.8285714
## 5 0.7000000 0.7536232 0.9538462 0.7428571
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 0.8800000 0.5909091 0.6575342 0.6493506 1.0000000
## 2 0.9333333 0.8484848 0.7123288 0.8701299 0.9156627
## 3 0.9866667 1.0000000 0.7260274 0.7532468 0.9879518
## 4 0.8000000 0.5303030 0.7808219 0.4415584 0.8433735
## 5 0.9466667 0.5151515 0.8904110 0.7142857 0.9638554
## mentality_aggression mentality_interceptions mentality_positioning
## 1 0.3200000 0.3703704 0.9642857
## 2 0.8133333 0.4814815 0.9880952
## 3 0.5733333 0.2345679 0.9880952
## 4 0.5733333 0.3333333 0.8809524
## 5 0.7466667 0.6913580 0.9047619
## mentality_vision mentality_penalties mentality_composure
## 1 1.0000000 0.7750 1.0000000
## 2 0.8292683 0.9625 0.8787879
## 3 0.7682927 0.9375 0.9848485
## 4 0.9390244 1.0000 0.9545455
## 5 0.9878049 0.8750 0.8939394
## defending_marking_awareness defending_standing_tackle
## 1 0.1204819 0.3012048
## 2 0.3012048 0.3855422
## 3 0.1686747 0.2650602
## 4 0.3012048 0.2650602
## 5 0.6987952 0.6626506
## defending_sliding_tackle
## 1 0.1707317
## 2 0.1097561
## 3 0.1707317
## 4 0.2317073
## 5 0.5243902
2.3 Correlation matrix and feature selection
We create a correlation matrix. It is big and maybe a bit hard to read, but R gives us the visually appealing option to group plotted features into highly correlated clusters.
cormatrix <- cor(players_norm)
corrplot(cor(players_norm), method = 'shade', sig.level = 0.10, type = 'lower', order = 'hclust', title = "Correlation plot before feature selection")
Now, in order to reduce the number of features, we take away the ones that provide the data with the highest overall correlation.
highcorr <- findCorrelation(cormatrix, cutoff=0.8)
highcorr
## [1] 9 5 6 17 34 13 31 21 15 10 22 40 39 33 11 23
col2<-colnames(players_norm)
col2
## [1] "age" "height_cm"
## [3] "weight_kg" "pace"
## [5] "shooting" "passing"
## [7] "preferred_foot" "weak_foot"
## [9] "dribbling" "defending"
## [11] "physic" "attacking_crossing"
## [13] "attacking_finishing" "attacking_heading_accuracy"
## [15] "attacking_short_passing" "attacking_volleys"
## [17] "skill_dribbling" "skill_curve"
## [19] "skill_fk_accuracy" "skill_long_passing"
## [21] "skill_ball_control" "movement_acceleration"
## [23] "movement_sprint_speed" "movement_agility"
## [25] "movement_reactions" "movement_balance"
## [27] "power_shot_power" "power_jumping"
## [29] "power_stamina" "power_strength"
## [31] "power_long_shots" "mentality_aggression"
## [33] "mentality_interceptions" "mentality_positioning"
## [35] "mentality_vision" "mentality_penalties"
## [37] "mentality_composure" "defending_marking_awareness"
## [39] "defending_standing_tackle" "defending_sliding_tackle"
col2<-col2[-highcorr]
corrplot.mixed(cor(players_norm[highcorr]), lower = "number", upper="shade", tl.pos = 'lt')
Now we take a look if we eliminated some of the dark spots from our correlation matrix.
corrplot(cor(players_norm[col2]), type = 'lower',method = 'shade', order = 'hclust', title = "Correlation plot after feature selection")
players_model <- subset(players_norm)
#we can add the positions back
players_model$player_positions <- c(players_22$player_positions)
We did. Looks much better and ready for further investigation.
2.4 Individual feature investigation
We want to look at the individual distributions of each of the features left. We fit violin plots, and put boxplots on top of them.
#here we do the cool violin plots to check distributions
par(mfrow=c(4,2))
ggplot(data = melt(players_norm[,1:5]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,6:10]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
Weak foot is a discrete RV with values in 1-5. Preferred foot is +/-1, as discussed above. Still, as in real life, a significantly larger proportion of right-footed people.
ggplot(data = melt(players_norm[,11:15]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,16:20]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,21:25]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,26:30]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,31:35]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
ggplot(data = melt(players_norm[,36:40]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using as id variables
2.5 Principal Component Analysis
players.pca<-prcomp(players_norm,center=TRUE, scale.=TRUE)
summary(players.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.8338 2.9386 2.0527 1.53692 1.08836 1.00962 0.90009
## Proportion of Variance 0.3674 0.2159 0.1053 0.05905 0.02961 0.02548 0.02025
## Cumulative Proportion 0.3674 0.5833 0.6887 0.74772 0.77733 0.80281 0.82307
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.86204 0.78772 0.77920 0.70934 0.65328 0.6326 0.60847
## Proportion of Variance 0.01858 0.01551 0.01518 0.01258 0.01067 0.0100 0.00926
## Cumulative Proportion 0.84164 0.85716 0.87233 0.88491 0.89558 0.9056 0.91484
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.57770 0.54221 0.51429 0.49240 0.48928 0.4732 0.46759
## Proportion of Variance 0.00834 0.00735 0.00661 0.00606 0.00598 0.0056 0.00547
## Cumulative Proportion 0.92319 0.93054 0.93715 0.94321 0.94920 0.9548 0.96026
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.44503 0.43329 0.41573 0.39871 0.37169 0.35022 0.3462
## Proportion of Variance 0.00495 0.00469 0.00432 0.00397 0.00345 0.00307 0.0030
## Cumulative Proportion 0.96521 0.96990 0.97422 0.97820 0.98165 0.98472 0.9877
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.32974 0.31684 0.31578 0.2825 0.26595 0.17110 0.02476
## Proportion of Variance 0.00272 0.00251 0.00249 0.0020 0.00177 0.00073 0.00002
## Cumulative Proportion 0.99043 0.99294 0.99544 0.9974 0.99920 0.99993 0.99995
## PC36 PC37 PC38 PC39 PC40
## Standard deviation 0.02424 0.02309 0.02131 0.01725 0.01537
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion 0.99996 0.99998 0.99999 0.99999 1.00000
We obtain 40 components. We want to visualise them.
fviz_eig(players.pca, addlabels = TRUE)
The first 5 components account for 77.7% of the explained variance, while the first 2 for 58.3% of it. Now we want to see how our features project into the main 2D factor plane.
fviz_pca_var(players.pca, labelsize = 2, alpha.var = 1.0, title = "Factor Plane for the FIFA 22 Data")
Now its finally time to dive into the actual modelling process. We experiment and compare different classification algorithms.
3.1 Train-validation split
Classical split for training and testing models. We keep the classical 70%-30% approach.
## 70% of the sample size
smp_size <- floor(0.7 * nrow(players_model))
train_ind <- sample(seq_len(nrow(players_model)), size = smp_size)
train <- players_model[train_ind, ]
test <- players_model[-train_ind, ]
print('Train set size:')
## [1] "Train set size:"
print(dim(test))
## [1] 3958 41
print('Validation set size:')
## [1] "Validation set size:"
print(dim(train))
## [1] 9235 41
We factorise the labes, so we can use them in our models.
#factorise labels
train_y <- as.factor(train[,41])
test_y <- as.factor(test[,41])
#remove labels from sets
train <- train[1:(length(train)-1)]
test <- test[1:(length(test)-1)]
Just to take a sneak peek, this is how the validation labels are roughly distributed on the factor plane.We notice that the factor plane sepparates some types of labels quite good, some not.
test.pca<-prcomp(test,center=TRUE, scale.=TRUE)
fviz_pca_biplot(test.pca,
label = "all",
col.ind = test_y,
legend.title = "Players",
title = "Classification of players")
3.2 Useful functions
Before we train any model, we want to create a function that computes accuracy, and one that selects the missclassified data so we can visualize it later on the factor plane.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
missclassified <- function(pred, label){
l<- pred
l[c(pred)==c(label)]<- 0
return (as.factor(l))
}
3.3 Knn
##run knn function
class <- factor(c(train_y))
train <- train[1:(length(train)-1)]
test <- test[1:(length(test)-1)]
accuracy_vect <- c()
ks<- c()
for(k1 in seq(5,100,5)) {
test_pred <-knn(train = train, test = test, cl = class, k = k1)
accuracy_vect <- append(accuracy_vect,accuracy(table(test_y,test_pred)))
ks <- append(ks, k1)
}
plot(ks, accuracy_vect, type = "p", col="blue", xlab="K's", ylab="accuracys", main="Accuracy vs K value plot")
We get the best k and its accuracy.
print('The best K in our case is:')
## [1] "The best K in our case is:"
print(ks[which.max(accuracy_vect)])
## [1] 25
print('And it gives us an accuracy of:' )
## [1] "And it gives us an accuracy of:"
print(accuracy_vect[which.max(accuracy_vect)])
## [1] 70.89439
test_pred <-knn(train = train, test = test, cl = class, k = 40)
df_pred=data.frame(test_y,test_pred)
We generate a confusion matrix to check misslabeled data
#Evaluate the model performance
CrossTable(x=test_y, y=test_pred,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | test_pred
## test_y | CAM | CB | CDM | CM | LB | LW | RB | RW | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CAM | 114 | 0 | 0 | 73 | 6 | 32 | 1 | 48 | 27 | 301 |
## | 0.379 | 0.000 | 0.000 | 0.243 | 0.020 | 0.106 | 0.003 | 0.159 | 0.090 | 0.076 |
## | 0.556 | 0.000 | 0.000 | 0.106 | 0.014 | 0.124 | 0.003 | 0.152 | 0.039 | |
## | 0.029 | 0.000 | 0.000 | 0.018 | 0.002 | 0.008 | 0.000 | 0.012 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 0 | 646 | 26 | 4 | 29 | 0 | 20 | 0 | 0 | 725 |
## | 0.000 | 0.891 | 0.036 | 0.006 | 0.040 | 0.000 | 0.028 | 0.000 | 0.000 | 0.183 |
## | 0.000 | 0.909 | 0.088 | 0.006 | 0.065 | 0.000 | 0.057 | 0.000 | 0.000 | |
## | 0.000 | 0.163 | 0.007 | 0.001 | 0.007 | 0.000 | 0.005 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 0 | 36 | 210 | 114 | 17 | 0 | 14 | 0 | 0 | 391 |
## | 0.000 | 0.092 | 0.537 | 0.292 | 0.043 | 0.000 | 0.036 | 0.000 | 0.000 | 0.099 |
## | 0.000 | 0.051 | 0.709 | 0.165 | 0.038 | 0.000 | 0.040 | 0.000 | 0.000 | |
## | 0.000 | 0.009 | 0.053 | 0.029 | 0.004 | 0.000 | 0.004 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 24 | 4 | 40 | 385 | 22 | 1 | 0 | 5 | 0 | 481 |
## | 0.050 | 0.008 | 0.083 | 0.800 | 0.046 | 0.002 | 0.000 | 0.010 | 0.000 | 0.122 |
## | 0.117 | 0.006 | 0.135 | 0.558 | 0.050 | 0.004 | 0.000 | 0.016 | 0.000 | |
## | 0.006 | 0.001 | 0.010 | 0.097 | 0.006 | 0.000 | 0.000 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 0 | 11 | 1 | 9 | 331 | 0 | 15 | 1 | 0 | 368 |
## | 0.000 | 0.030 | 0.003 | 0.024 | 0.899 | 0.000 | 0.041 | 0.003 | 0.000 | 0.093 |
## | 0.000 | 0.015 | 0.003 | 0.013 | 0.747 | 0.000 | 0.043 | 0.003 | 0.000 | |
## | 0.000 | 0.003 | 0.000 | 0.002 | 0.084 | 0.000 | 0.004 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LW | 30 | 0 | 2 | 22 | 29 | 115 | 1 | 100 | 52 | 351 |
## | 0.085 | 0.000 | 0.006 | 0.063 | 0.083 | 0.328 | 0.003 | 0.285 | 0.148 | 0.089 |
## | 0.146 | 0.000 | 0.007 | 0.032 | 0.065 | 0.446 | 0.003 | 0.317 | 0.076 | |
## | 0.008 | 0.000 | 0.001 | 0.006 | 0.007 | 0.029 | 0.000 | 0.025 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 0 | 14 | 16 | 31 | 2 | 0 | 287 | 1 | 0 | 351 |
## | 0.000 | 0.040 | 0.046 | 0.088 | 0.006 | 0.000 | 0.818 | 0.003 | 0.000 | 0.089 |
## | 0.000 | 0.020 | 0.054 | 0.045 | 0.005 | 0.000 | 0.815 | 0.003 | 0.000 | |
## | 0.000 | 0.004 | 0.004 | 0.008 | 0.001 | 0.000 | 0.073 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RW | 28 | 0 | 0 | 42 | 7 | 82 | 14 | 127 | 46 | 346 |
## | 0.081 | 0.000 | 0.000 | 0.121 | 0.020 | 0.237 | 0.040 | 0.367 | 0.133 | 0.087 |
## | 0.137 | 0.000 | 0.000 | 0.061 | 0.016 | 0.318 | 0.040 | 0.403 | 0.067 | |
## | 0.007 | 0.000 | 0.000 | 0.011 | 0.002 | 0.021 | 0.004 | 0.032 | 0.012 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 9 | 0 | 1 | 10 | 0 | 28 | 0 | 33 | 563 | 644 |
## | 0.014 | 0.000 | 0.002 | 0.016 | 0.000 | 0.043 | 0.000 | 0.051 | 0.874 | 0.163 |
## | 0.044 | 0.000 | 0.003 | 0.014 | 0.000 | 0.109 | 0.000 | 0.105 | 0.818 | |
## | 0.002 | 0.000 | 0.000 | 0.003 | 0.000 | 0.007 | 0.000 | 0.008 | 0.142 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 205 | 711 | 296 | 690 | 443 | 258 | 352 | 315 | 688 | 3958 |
## | 0.052 | 0.180 | 0.075 | 0.174 | 0.112 | 0.065 | 0.089 | 0.080 | 0.174 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
#creating confusion matrix
conf_mat <- confusion_matrix(targets = test_y,
predictions = test_pred)
Now we visualize it on the factor plane
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(test_pred,test_y),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for KNN")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2778 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
3.4 Random Forrest
The hyperparameter we experiment with is the number of randomly sampled variables. Changing the number of trees does not do much, and from previous experimentation we realized that around 500 is the optimum value.
set.seed(123)
a=c()
i=5
for (i in 5:10) {
model_RF <- randomForest(train_y ~ ., data = train, ntree = 500, mtry = i, importance = TRUE)
prediction_RF <- predict(model_RF, test, type = "class")
a[i-4] = mean(prediction_RF == test_y) # nicer way to do accuracy than we did
}
plot(5:10,a)
a
## [1] 0.7304194 0.7314300 0.7306721 0.7337039 0.7299141 0.7304194
a = 8 is the best one.
We plot missclassified labels again on the factor plane.
model_RF <- randomForest(train_y ~ ., data = train, ntree = 500, mtry = 8, importance = TRUE)
prediction_RF <- predict(model_RF, test, type = "class")
summary(model_RF)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 9235 factor numeric
## err.rate 5000 -none- numeric
## confusion 90 -none- numeric
## votes 83115 matrix numeric
## oob.times 9235 -none- numeric
## classes 9 -none- character
## importance 429 -none- numeric
## importanceSD 390 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 9235 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(prediction_RF,test_y),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for RF")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2902 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
We generate a confusion matrix to check misslabeled data
#Evaluate the model performance
CrossTable(x=test_y, y=prediction_RF,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | prediction_RF
## test_y | CAM | CB | CDM | CM | LB | LW | RB | RW | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CAM | 157 | 0 | 0 | 56 | 3 | 23 | 0 | 37 | 25 | 301 |
## | 0.522 | 0.000 | 0.000 | 0.186 | 0.010 | 0.076 | 0.000 | 0.123 | 0.083 | 0.076 |
## | 0.618 | 0.000 | 0.000 | 0.093 | 0.008 | 0.107 | 0.000 | 0.102 | 0.036 | |
## | 0.040 | 0.000 | 0.000 | 0.014 | 0.001 | 0.006 | 0.000 | 0.009 | 0.006 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 0 | 665 | 24 | 0 | 20 | 0 | 16 | 0 | 0 | 725 |
## | 0.000 | 0.917 | 0.033 | 0.000 | 0.028 | 0.000 | 0.022 | 0.000 | 0.000 | 0.183 |
## | 0.000 | 0.890 | 0.071 | 0.000 | 0.054 | 0.000 | 0.043 | 0.000 | 0.000 | |
## | 0.000 | 0.168 | 0.006 | 0.000 | 0.005 | 0.000 | 0.004 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 0 | 48 | 240 | 89 | 5 | 0 | 9 | 0 | 0 | 391 |
## | 0.000 | 0.123 | 0.614 | 0.228 | 0.013 | 0.000 | 0.023 | 0.000 | 0.000 | 0.099 |
## | 0.000 | 0.064 | 0.706 | 0.147 | 0.013 | 0.000 | 0.024 | 0.000 | 0.000 | |
## | 0.000 | 0.012 | 0.061 | 0.022 | 0.001 | 0.000 | 0.002 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 23 | 0 | 58 | 386 | 5 | 2 | 3 | 4 | 0 | 481 |
## | 0.048 | 0.000 | 0.121 | 0.802 | 0.010 | 0.004 | 0.006 | 0.008 | 0.000 | 0.122 |
## | 0.091 | 0.000 | 0.171 | 0.639 | 0.013 | 0.009 | 0.008 | 0.011 | 0.000 | |
## | 0.006 | 0.000 | 0.015 | 0.098 | 0.001 | 0.001 | 0.001 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 0 | 17 | 2 | 9 | 318 | 1 | 18 | 3 | 0 | 368 |
## | 0.000 | 0.046 | 0.005 | 0.024 | 0.864 | 0.003 | 0.049 | 0.008 | 0.000 | 0.093 |
## | 0.000 | 0.023 | 0.006 | 0.015 | 0.853 | 0.005 | 0.049 | 0.008 | 0.000 | |
## | 0.000 | 0.004 | 0.001 | 0.002 | 0.080 | 0.000 | 0.005 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LW | 26 | 0 | 2 | 19 | 20 | 100 | 1 | 138 | 45 | 351 |
## | 0.074 | 0.000 | 0.006 | 0.054 | 0.057 | 0.285 | 0.003 | 0.393 | 0.128 | 0.089 |
## | 0.102 | 0.000 | 0.006 | 0.031 | 0.054 | 0.467 | 0.003 | 0.380 | 0.065 | |
## | 0.007 | 0.000 | 0.001 | 0.005 | 0.005 | 0.025 | 0.000 | 0.035 | 0.011 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 0 | 17 | 14 | 13 | 2 | 0 | 302 | 3 | 0 | 351 |
## | 0.000 | 0.048 | 0.040 | 0.037 | 0.006 | 0.000 | 0.860 | 0.009 | 0.000 | 0.089 |
## | 0.000 | 0.023 | 0.041 | 0.022 | 0.005 | 0.000 | 0.818 | 0.008 | 0.000 | |
## | 0.000 | 0.004 | 0.004 | 0.003 | 0.001 | 0.000 | 0.076 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RW | 37 | 0 | 0 | 24 | 0 | 70 | 19 | 153 | 43 | 346 |
## | 0.107 | 0.000 | 0.000 | 0.069 | 0.000 | 0.202 | 0.055 | 0.442 | 0.124 | 0.087 |
## | 0.146 | 0.000 | 0.000 | 0.040 | 0.000 | 0.327 | 0.051 | 0.421 | 0.062 | |
## | 0.009 | 0.000 | 0.000 | 0.006 | 0.000 | 0.018 | 0.005 | 0.039 | 0.011 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 11 | 0 | 0 | 8 | 0 | 18 | 1 | 25 | 581 | 644 |
## | 0.017 | 0.000 | 0.000 | 0.012 | 0.000 | 0.028 | 0.002 | 0.039 | 0.902 | 0.163 |
## | 0.043 | 0.000 | 0.000 | 0.013 | 0.000 | 0.084 | 0.003 | 0.069 | 0.837 | |
## | 0.003 | 0.000 | 0.000 | 0.002 | 0.000 | 0.005 | 0.000 | 0.006 | 0.147 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 254 | 747 | 340 | 604 | 373 | 214 | 369 | 363 | 694 | 3958 |
## | 0.064 | 0.189 | 0.086 | 0.153 | 0.094 | 0.054 | 0.093 | 0.092 | 0.175 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
3.5 SVM
svm1 <- svm(formula= train_y~., data=train,
type="C-classification", kernal="radial",
gamma=0.1, cost=10)
We produce a summary of the model.
prediction_svm <- predict(svm1,test, type = "class")
accuracy(table(test_y, prediction_svm))
## [1] 71.52602
summary(svm1)
##
## Call:
## svm(formula = train_y ~ ., data = train, type = "C-classification",
## kernal = "radial", gamma = 0.1, cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 7205
##
## ( 979 588 874 809 694 792 1071 806 592 )
##
##
## Number of Classes: 9
##
## Levels:
## CAM CB CDM CM LB LW RB RW ST
We plot misslabeled data
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(prediction_svm,test_y),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for SVM")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2831 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
We generate a confusion matrix to check misslabeled data
#Evaluate the model performance
CrossTable(x=test_y, y=prediction_svm,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | prediction_svm
## test_y | CAM | CB | CDM | CM | LB | LW | RB | RW | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CAM | 134 | 0 | 1 | 59 | 0 | 35 | 1 | 47 | 24 | 301 |
## | 0.445 | 0.000 | 0.003 | 0.196 | 0.000 | 0.116 | 0.003 | 0.156 | 0.080 | 0.076 |
## | 0.558 | 0.000 | 0.003 | 0.109 | 0.000 | 0.117 | 0.003 | 0.123 | 0.037 | |
## | 0.034 | 0.000 | 0.000 | 0.015 | 0.000 | 0.009 | 0.000 | 0.012 | 0.006 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 0 | 660 | 26 | 3 | 19 | 0 | 17 | 0 | 0 | 725 |
## | 0.000 | 0.910 | 0.036 | 0.004 | 0.026 | 0.000 | 0.023 | 0.000 | 0.000 | 0.183 |
## | 0.000 | 0.862 | 0.073 | 0.006 | 0.051 | 0.000 | 0.048 | 0.000 | 0.000 | |
## | 0.000 | 0.167 | 0.007 | 0.001 | 0.005 | 0.000 | 0.004 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 0 | 50 | 240 | 88 | 6 | 0 | 7 | 0 | 0 | 391 |
## | 0.000 | 0.128 | 0.614 | 0.225 | 0.015 | 0.000 | 0.018 | 0.000 | 0.000 | 0.099 |
## | 0.000 | 0.065 | 0.676 | 0.163 | 0.016 | 0.000 | 0.020 | 0.000 | 0.000 | |
## | 0.000 | 0.013 | 0.061 | 0.022 | 0.002 | 0.000 | 0.002 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 37 | 4 | 75 | 349 | 3 | 6 | 0 | 5 | 2 | 481 |
## | 0.077 | 0.008 | 0.156 | 0.726 | 0.006 | 0.012 | 0.000 | 0.010 | 0.004 | 0.122 |
## | 0.154 | 0.005 | 0.211 | 0.647 | 0.008 | 0.020 | 0.000 | 0.013 | 0.003 | |
## | 0.009 | 0.001 | 0.019 | 0.088 | 0.001 | 0.002 | 0.000 | 0.001 | 0.001 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 0 | 23 | 2 | 6 | 312 | 5 | 19 | 1 | 0 | 368 |
## | 0.000 | 0.062 | 0.005 | 0.016 | 0.848 | 0.014 | 0.052 | 0.003 | 0.000 | 0.093 |
## | 0.000 | 0.030 | 0.006 | 0.011 | 0.846 | 0.017 | 0.054 | 0.003 | 0.000 | |
## | 0.000 | 0.006 | 0.001 | 0.002 | 0.079 | 0.001 | 0.005 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LW | 31 | 1 | 2 | 11 | 19 | 121 | 2 | 126 | 38 | 351 |
## | 0.088 | 0.003 | 0.006 | 0.031 | 0.054 | 0.345 | 0.006 | 0.359 | 0.108 | 0.089 |
## | 0.129 | 0.001 | 0.006 | 0.020 | 0.051 | 0.406 | 0.006 | 0.330 | 0.058 | |
## | 0.008 | 0.000 | 0.001 | 0.003 | 0.005 | 0.031 | 0.001 | 0.032 | 0.010 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 0 | 27 | 9 | 5 | 9 | 3 | 293 | 5 | 0 | 351 |
## | 0.000 | 0.077 | 0.026 | 0.014 | 0.026 | 0.009 | 0.835 | 0.014 | 0.000 | 0.089 |
## | 0.000 | 0.035 | 0.025 | 0.009 | 0.024 | 0.010 | 0.828 | 0.013 | 0.000 | |
## | 0.000 | 0.007 | 0.002 | 0.001 | 0.002 | 0.001 | 0.074 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RW | 28 | 0 | 0 | 12 | 1 | 94 | 14 | 164 | 33 | 346 |
## | 0.081 | 0.000 | 0.000 | 0.035 | 0.003 | 0.272 | 0.040 | 0.474 | 0.095 | 0.087 |
## | 0.117 | 0.000 | 0.000 | 0.022 | 0.003 | 0.315 | 0.040 | 0.429 | 0.050 | |
## | 0.007 | 0.000 | 0.000 | 0.003 | 0.000 | 0.024 | 0.004 | 0.041 | 0.008 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 10 | 1 | 0 | 6 | 0 | 34 | 1 | 34 | 558 | 644 |
## | 0.016 | 0.002 | 0.000 | 0.009 | 0.000 | 0.053 | 0.002 | 0.053 | 0.866 | 0.163 |
## | 0.042 | 0.001 | 0.000 | 0.011 | 0.000 | 0.114 | 0.003 | 0.089 | 0.852 | |
## | 0.003 | 0.000 | 0.000 | 0.002 | 0.000 | 0.009 | 0.000 | 0.009 | 0.141 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 240 | 766 | 355 | 539 | 369 | 298 | 354 | 382 | 655 | 3958 |
## | 0.061 | 0.194 | 0.090 | 0.136 | 0.093 | 0.075 | 0.089 | 0.097 | 0.165 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
3.6 Label Grouping
The accuracies obtained are decent but not great, and the confusion matrix clearly explains why. Positions like CB, ST, LB, RB get classified really well. On the opposite side, the most commonly misclassified position are CAM with CM, and RW with LW and viceversa.
The first misclassification is explainable with basic attributes of the role. Centrer Attacking Midfielder shares a lot of attacking characteristics with the Winger such as shooting and pace but also many with CM, like passing.
The second one is a bit more tricky to detect. For Left Back and Right Back the preferred foot plays a big role, since it’s hard to find a righty who plays on the left and viceversa, because they cross and tackle mostly with their dominant foot. For RW and LW the distinction is less definable based on the preferred foot. On one hand, a lot of righty players like to play as Left Winger so they can converge to the center to shoot with their strong foot. Same is true for lefty on RW. On the other hand, many Wingers like to cross more, so they tend to do it with their preferred foot (LW with left and RW with right). So for the model of course it’s really not an easy job to detect these differences that pertain to the single player style of play; and this problem explains the drop in accuracy for these positions. In order to improve the accuracy of our classifiers, we group RW and LW together in a new position ‘W = Winger’ and the CAM with CM.
test_y2 <- test_y
levels(test_y2)[levels(test_y2) == "RW"| levels(test_y2) == "LW"] <- "W"
levels(test_y2)[levels(test_y2) == "CAM"| levels(test_y2) == "CM"] <- "CM"
train_y2 <- train_y
levels(train_y2)[levels(train_y2) == "RW"| levels(train_y2) == "LW"] <- "W"
levels(train_y2)[levels(train_y2) == "CAM"| levels(train_y2) == "CM"] <- "CM"
unique(test_y2)
## [1] ST W CDM CM RB CB LB
## Levels: CM CB CDM LB W RB ST
#plot pie chart again
cat<- table(factor(test_y2))
pie(cat, col = hcl.colors(length(cat), "BluYl"))
This is the new distribution of labels. Now we reproduce the same experiments, expecting a hefty increase in accuracy, with the price of ablation. 3.6.1 Knn
prediction_knn2 <-knn(train = train, test = test, cl = train_y2, k = 20)
CrossTable(x=test_y2, y=prediction_knn2,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | prediction_knn2
## test_y2 | CM | CB | CDM | LB | W | RB | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 597 | 3 | 39 | 13 | 108 | 1 | 21 | 782 |
## | 0.763 | 0.004 | 0.050 | 0.017 | 0.138 | 0.001 | 0.027 | 0.198 |
## | 0.678 | 0.004 | 0.134 | 0.032 | 0.156 | 0.003 | 0.034 | |
## | 0.151 | 0.001 | 0.010 | 0.003 | 0.027 | 0.000 | 0.005 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 2 | 651 | 28 | 24 | 0 | 20 | 0 | 725 |
## | 0.003 | 0.898 | 0.039 | 0.033 | 0.000 | 0.028 | 0.000 | 0.183 |
## | 0.002 | 0.905 | 0.096 | 0.059 | 0.000 | 0.058 | 0.000 | |
## | 0.001 | 0.164 | 0.007 | 0.006 | 0.000 | 0.005 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 121 | 36 | 205 | 14 | 0 | 15 | 0 | 391 |
## | 0.309 | 0.092 | 0.524 | 0.036 | 0.000 | 0.038 | 0.000 | 0.099 |
## | 0.137 | 0.050 | 0.702 | 0.034 | 0.000 | 0.043 | 0.000 | |
## | 0.031 | 0.009 | 0.052 | 0.004 | 0.000 | 0.004 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 13 | 12 | 1 | 325 | 1 | 16 | 0 | 368 |
## | 0.035 | 0.033 | 0.003 | 0.883 | 0.003 | 0.043 | 0.000 | 0.093 |
## | 0.015 | 0.017 | 0.003 | 0.800 | 0.001 | 0.046 | 0.000 | |
## | 0.003 | 0.003 | 0.000 | 0.082 | 0.000 | 0.004 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## W | 99 | 0 | 2 | 28 | 493 | 14 | 61 | 697 |
## | 0.142 | 0.000 | 0.003 | 0.040 | 0.707 | 0.020 | 0.088 | 0.176 |
## | 0.112 | 0.000 | 0.007 | 0.069 | 0.712 | 0.040 | 0.098 | |
## | 0.025 | 0.000 | 0.001 | 0.007 | 0.125 | 0.004 | 0.015 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 31 | 17 | 16 | 2 | 5 | 280 | 0 | 351 |
## | 0.088 | 0.048 | 0.046 | 0.006 | 0.014 | 0.798 | 0.000 | 0.089 |
## | 0.035 | 0.024 | 0.055 | 0.005 | 0.007 | 0.809 | 0.000 | |
## | 0.008 | 0.004 | 0.004 | 0.001 | 0.001 | 0.071 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 18 | 0 | 1 | 0 | 85 | 0 | 540 | 644 |
## | 0.028 | 0.000 | 0.002 | 0.000 | 0.132 | 0.000 | 0.839 | 0.163 |
## | 0.020 | 0.000 | 0.003 | 0.000 | 0.123 | 0.000 | 0.868 | |
## | 0.005 | 0.000 | 0.000 | 0.000 | 0.021 | 0.000 | 0.136 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 881 | 719 | 292 | 406 | 692 | 346 | 622 | 3958 |
## | 0.223 | 0.182 | 0.074 | 0.103 | 0.175 | 0.087 | 0.157 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
The confusion matrix looks much better
accuracy(table(prediction_knn2, test_y2))
## [1] 78.095
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(prediction_knn2, test_y2),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for KNN2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3091 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
3.6.2 Random Forrest
model_RF2 <- randomForest(train_y2 ~ ., data = train, ntree = 500, mtry = 8, importance = TRUE)
prediction_RF2 <- predict(model_RF2, test, type = "class")
summary(model_RF2)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 9235 factor numeric
## err.rate 4000 -none- numeric
## confusion 56 -none- numeric
## votes 64645 matrix numeric
## oob.times 9235 -none- numeric
## classes 7 -none- character
## importance 351 -none- numeric
## importanceSD 312 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 9235 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
accuracy(table(prediction_RF2, test_y2))
## [1] 80.29308
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(prediction_RF2,test_y2),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for RF2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3178 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
We generate a confusion matrix to check misslabeled data
#Evaluate the model performance
CrossTable(x=test_y, y=prediction_RF2,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | prediction_RF2
## test_y | CM | CB | CDM | LB | W | RB | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CAM | 183 | 0 | 0 | 0 | 98 | 0 | 20 | 301 |
## | 0.608 | 0.000 | 0.000 | 0.000 | 0.326 | 0.000 | 0.066 | 0.076 |
## | 0.223 | 0.000 | 0.000 | 0.000 | 0.138 | 0.000 | 0.032 | |
## | 0.046 | 0.000 | 0.000 | 0.000 | 0.025 | 0.000 | 0.005 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 1 | 662 | 24 | 21 | 0 | 17 | 0 | 725 |
## | 0.001 | 0.913 | 0.033 | 0.029 | 0.000 | 0.023 | 0.000 | 0.183 |
## | 0.001 | 0.886 | 0.072 | 0.058 | 0.000 | 0.047 | 0.000 | |
## | 0.000 | 0.167 | 0.006 | 0.005 | 0.000 | 0.004 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 93 | 48 | 234 | 5 | 1 | 10 | 0 | 391 |
## | 0.238 | 0.123 | 0.598 | 0.013 | 0.003 | 0.026 | 0.000 | 0.099 |
## | 0.114 | 0.064 | 0.705 | 0.014 | 0.001 | 0.028 | 0.000 | |
## | 0.023 | 0.012 | 0.059 | 0.001 | 0.000 | 0.003 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 414 | 0 | 54 | 4 | 6 | 3 | 0 | 481 |
## | 0.861 | 0.000 | 0.112 | 0.008 | 0.012 | 0.006 | 0.000 | 0.122 |
## | 0.505 | 0.000 | 0.163 | 0.011 | 0.008 | 0.008 | 0.000 | |
## | 0.105 | 0.000 | 0.014 | 0.001 | 0.002 | 0.001 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 10 | 19 | 3 | 312 | 6 | 18 | 0 | 368 |
## | 0.027 | 0.052 | 0.008 | 0.848 | 0.016 | 0.049 | 0.000 | 0.093 |
## | 0.012 | 0.025 | 0.009 | 0.864 | 0.008 | 0.050 | 0.000 | |
## | 0.003 | 0.005 | 0.001 | 0.079 | 0.002 | 0.005 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LW | 37 | 0 | 2 | 16 | 269 | 1 | 26 | 351 |
## | 0.105 | 0.000 | 0.006 | 0.046 | 0.766 | 0.003 | 0.074 | 0.089 |
## | 0.045 | 0.000 | 0.006 | 0.044 | 0.378 | 0.003 | 0.042 | |
## | 0.009 | 0.000 | 0.001 | 0.004 | 0.068 | 0.000 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 14 | 18 | 15 | 2 | 4 | 298 | 0 | 351 |
## | 0.040 | 0.051 | 0.043 | 0.006 | 0.011 | 0.849 | 0.000 | 0.089 |
## | 0.017 | 0.024 | 0.045 | 0.006 | 0.006 | 0.821 | 0.000 | |
## | 0.004 | 0.005 | 0.004 | 0.001 | 0.001 | 0.075 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RW | 51 | 0 | 0 | 1 | 253 | 15 | 26 | 346 |
## | 0.147 | 0.000 | 0.000 | 0.003 | 0.731 | 0.043 | 0.075 | 0.087 |
## | 0.062 | 0.000 | 0.000 | 0.003 | 0.356 | 0.041 | 0.042 | |
## | 0.013 | 0.000 | 0.000 | 0.000 | 0.064 | 0.004 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 16 | 0 | 0 | 0 | 74 | 1 | 553 | 644 |
## | 0.025 | 0.000 | 0.000 | 0.000 | 0.115 | 0.002 | 0.859 | 0.163 |
## | 0.020 | 0.000 | 0.000 | 0.000 | 0.104 | 0.003 | 0.885 | |
## | 0.004 | 0.000 | 0.000 | 0.000 | 0.019 | 0.000 | 0.140 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 819 | 747 | 332 | 361 | 711 | 363 | 625 | 3958 |
## | 0.207 | 0.189 | 0.084 | 0.091 | 0.180 | 0.092 | 0.158 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
3.6.3 SVM
svm2 <- svm(formula= train_y2~., data=train,
type="C-classification", kernal="radial",
gamma=0.1, cost=10)
prediction_svm2 <- predict(svm2, test, type = "class")
summary(svm2)
##
## Call:
## svm(formula = train_y2 ~ ., data = train, type = "C-classification",
## kernal = "radial", gamma = 0.1, cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 6667
##
## ( 978 585 1339 808 690 1487 780 )
##
##
## Number of Classes: 7
##
## Levels:
## CM CB CDM LB W RB ST
accuracy(table(test_y2, prediction_svm2))
## [1] 79.66145
fviz_pca_biplot(test.pca,
label = "all",
col.ind = missclassified(prediction_RF2,test_y2),
legend.title = "Players",
title = "Classification of labeled/misslabeled players for RF2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3178 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
We generate a confusion matrix to check misslabeled data
#Evaluate the model performance
CrossTable(x=test_y, y=prediction_svm2,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3958
##
##
## | prediction_svm2
## test_y | CM | CB | CDM | LB | W | RB | ST | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CAM | 199 | 0 | 1 | 0 | 84 | 1 | 16 | 301 |
## | 0.661 | 0.000 | 0.003 | 0.000 | 0.279 | 0.003 | 0.053 | 0.076 |
## | 0.248 | 0.000 | 0.003 | 0.000 | 0.116 | 0.003 | 0.026 | |
## | 0.050 | 0.000 | 0.000 | 0.000 | 0.021 | 0.000 | 0.004 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CB | 4 | 659 | 26 | 19 | 0 | 17 | 0 | 725 |
## | 0.006 | 0.909 | 0.036 | 0.026 | 0.000 | 0.023 | 0.000 | 0.183 |
## | 0.005 | 0.866 | 0.077 | 0.052 | 0.000 | 0.049 | 0.000 | |
## | 0.001 | 0.166 | 0.007 | 0.005 | 0.000 | 0.004 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CDM | 97 | 50 | 231 | 5 | 1 | 7 | 0 | 391 |
## | 0.248 | 0.128 | 0.591 | 0.013 | 0.003 | 0.018 | 0.000 | 0.099 |
## | 0.121 | 0.066 | 0.681 | 0.014 | 0.001 | 0.020 | 0.000 | |
## | 0.025 | 0.013 | 0.058 | 0.001 | 0.000 | 0.002 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## CM | 394 | 3 | 69 | 3 | 11 | 0 | 1 | 481 |
## | 0.819 | 0.006 | 0.143 | 0.006 | 0.023 | 0.000 | 0.002 | 0.122 |
## | 0.491 | 0.004 | 0.204 | 0.008 | 0.015 | 0.000 | 0.002 | |
## | 0.100 | 0.001 | 0.017 | 0.001 | 0.003 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LB | 6 | 23 | 2 | 311 | 9 | 17 | 0 | 368 |
## | 0.016 | 0.062 | 0.005 | 0.845 | 0.024 | 0.046 | 0.000 | 0.093 |
## | 0.007 | 0.030 | 0.006 | 0.852 | 0.012 | 0.049 | 0.000 | |
## | 0.002 | 0.006 | 0.001 | 0.079 | 0.002 | 0.004 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## LW | 40 | 0 | 2 | 17 | 259 | 1 | 32 | 351 |
## | 0.114 | 0.000 | 0.006 | 0.048 | 0.738 | 0.003 | 0.091 | 0.089 |
## | 0.050 | 0.000 | 0.006 | 0.047 | 0.357 | 0.003 | 0.052 | |
## | 0.010 | 0.000 | 0.001 | 0.004 | 0.065 | 0.000 | 0.008 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RB | 6 | 25 | 8 | 10 | 11 | 291 | 0 | 351 |
## | 0.017 | 0.071 | 0.023 | 0.028 | 0.031 | 0.829 | 0.000 | 0.089 |
## | 0.007 | 0.033 | 0.024 | 0.027 | 0.015 | 0.834 | 0.000 | |
## | 0.002 | 0.006 | 0.002 | 0.003 | 0.003 | 0.074 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## RW | 37 | 0 | 0 | 0 | 268 | 14 | 27 | 346 |
## | 0.107 | 0.000 | 0.000 | 0.000 | 0.775 | 0.040 | 0.078 | 0.087 |
## | 0.046 | 0.000 | 0.000 | 0.000 | 0.370 | 0.040 | 0.044 | |
## | 0.009 | 0.000 | 0.000 | 0.000 | 0.068 | 0.004 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## ST | 19 | 1 | 0 | 0 | 82 | 1 | 541 | 644 |
## | 0.030 | 0.002 | 0.000 | 0.000 | 0.127 | 0.002 | 0.840 | 0.163 |
## | 0.024 | 0.001 | 0.000 | 0.000 | 0.113 | 0.003 | 0.877 | |
## | 0.005 | 0.000 | 0.000 | 0.000 | 0.021 | 0.000 | 0.137 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 802 | 761 | 339 | 365 | 725 | 349 | 617 | 3958 |
## | 0.203 | 0.192 | 0.086 | 0.092 | 0.183 | 0.088 | 0.156 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
4. Conclusion and further research All in all, position classification is possible for some distinct areas of the football field, but for some specific ones is quite impossible, in the case of multiclass classification. We have tried some specific models for RW&LW, and CM&CAM, respectively, but the results we obtained were not far from random. This is because multiple footballers have the necessary attributes to equally play in multiple spots. In order to improve classification, a multilabel approach on all the player positions would be better.
On one hand, football is a very heterogeneous sport and often the values of the attributes cannot explain as a whole the position of a player since his style of play heavily influence how the role is interpreted and consequently where exactly the player acts on the field. On the other hand, we would also like to believe that with sufficient data, even effective positioning of real players could be calculated.